Interrogating sex as a biological variable in preterm birth inequity
BMIN503/EPID600 Final Project
Author
Rose Albert
1 Interrogating sex as a biological variable in preterm birth inequity
December 13, 2024
By Rose Albert
1.1 Overview
Preterm birth affects approximately 10% of all births, and there is a well-established male vulnerability and female resilience during the neonatal period. The goal of this project is to interrogate sex as a biological variable using county-level data from the National Vital Statistics System queried using CDC WONDER (Wide-Ranging Online Data for Epidemiologic Research). This project will assess preterm-birth rates and NICU admissions. I have spoken with two experts in perinatal health, Dr. Aimin Chen and Dr. Heather Burris, about biological and structural determinants contributing to preterm birth inequity, as well as potential areas of clinical, public health, and policy intervention.
More than a third of infant deaths in the US are preterm-related (Callaghan et al., 2006). Preterm birth can predispose infants to immediate risks in the neonatal period as well as long-term adverse sequelae. For example, preterm infants are born with underdeveloped lungs and often require therapeutic interventions such as surfactant administration, antibiotics, steroids, mechanical ventilation, and supplemental oxygen for survival (Thebaud et al., 2019). However, these therapeutics can also pose postnatal insults and contribute to lung injury and chronic diseases such as bronchopulmonary dysplasia (Thebaud et al., 2019). Despite advances in neonatal care that have improved overall survival of preterm infants, rates of bronchopulmonary dysplasia have continued to rise (Bell et al., 2022). Bronchopulmonary dysplasia is one of many male-biased complications of prematurity (van Westering-Kroon et al. 2021). In addition to a male disadvantage in neonatal outcomes, there is also a male vulnerability for pregnancy complications and indications for preterm birth (Inkster et al., 2021).
Interrogation of sex as a biological variable has received increasing attention as a critical area for understanding basic pathophysiology and differential response to therapeutics (Albert, Lee, & Lingappan, 2023). The NIH has also established consideration of sex as a biological variable as a research mandate (“NOT-OD-15-102”, 2015). This project will investigate male susceptibility to preterm birth and NICU admission using publicly available county-level data. An understanding of preterm birth characteristics can inform neonatal care toward precision medicine approaches.
1.3 Methods
Data sources
County-level Birth Characteristics Data Source: I am using the CDC Wonder Natality Data (2023) grouped by “County of Residence,” “Sex of Infant,” “NICU Admission,” and “OE Gestational Age Recode 10,” with the measures “Births” and “Percent of Total Births.”
Mapping Data Source: I am drawing from Assignment 5 to use geospatial data, tidy census, and leaflet to generate interactive maps.update.packages(“tidycensus”)
Data curation and cleaning
To work with this dataset, I loaded the necessary libraries and merged the county population data with the birth data by “County.of.Residence.Code”. To clean the data, I am removing redundant columns and renaming my variables. I am also creating a new variable for preterm birth by grouping the gestational ages that are <37 weeks. I then merge this with my county shape files.
#load necessary librarieslibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(sf)
Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(tigris)
To enable caching of data, set `options(tigris_use_cache = TRUE)`
in your R script or .Rprofile.
library(tidycensus)library(leaflet)options(tigris_use_cache =TRUE)options(progress_enabled =FALSE)#Load county-level birth data downloaded from CDC Wonderbirth_data <-read.delim("https://raw.githubusercontent.com/rosemalbert/BMIN503_Final_Project/refs/heads/master/Natality%2C%202016-2023%20expanded_sex.txt")county_data <-read.delim("https://raw.githubusercontent.com/rosemalbert/BMIN503_Final_Project/refs/heads/master/Natality%2C%202016-2023%20expanded%20_county.txt")#view column namescolnames(birth_data)
#merge birth_data and county_datamerged_data <-left_join(birth_data, county_data, by ="County.of.Residence.Code")
Warning in left_join(birth_data, county_data, by = "County.of.Residence.Code"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 11803 of `x` matches multiple rows in `y`.
ℹ Row 250 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
#clean merged_data library(dplyr)# clean merged data by removing unnecessary columns, checking discrepancies, and renamingcleaned_data <- merged_data %>%# Remove 'Notes.x' columnselect(-Notes.x) %>%# Check for discrepancies between 'County.of.Residence.x' and 'County.of.Residence.y'# mutate(County.of.Residence =ifelse( County.of.Residence.x == County.of.Residence.y, County.of.Residence.x, NA ) ) %>%# Merge 'County.of.Residence.x' and 'County.of.Residence.y'select(-County.of.Residence.x, -County.of.Residence.y) %>%# Rename 'Births.y' to 'Total.Births'rename(Total.Births = Births.y ) %>%# Remove 'Notes.y' columnselect(-Notes.y) %>%# Remove sex of infant codeselect(-Sex.of.Infant.Code) %>%# Remove 'X..of.Total.Births.y' and 'X..of.Total.Births.x'select(-c(X..of.Total.Births.x, X..of.Total.Births.y)) %>%# Remove 'OE.Gestational.Age.Recode.10.Code'select(-OE.Gestational.Age.Recode.10.Code) %>%# Remove 'NICU.Admission.Code'select(-NICU.Admission.Code) %>%# Filter out rows where 'NICU.Admission' or 'OE.Gestational.Age.Recode.10' are "Unknown"filter(!NICU.Admission %in%c("Unknown or Not Stated", ""),!OE.Gestational.Age.Recode.10%in%c("Unknown or Not Stated", "") ) %>%# Remove any duplicatesdistinct()cleaned_data <- cleaned_data %>%mutate(County.of.Residence.Code =as.character(County.of.Residence.Code))# Convert 'County.of.Residence.Code' to character before mergingpreterm_rate_data <- cleaned_data %>%mutate(Is.Preterm = OE.Gestational.Age.Recode.10%in%c("28 - 31 weeks", "32 - 35 weeks", "36 weeks", "20 - 27 weeks", "Under 20 weeks"),County.of.Residence.Code =as.character(County.of.Residence.Code) # Convert to character ) %>%group_by(County.of.Residence, County.of.Residence.Code) %>%# Keep 'County.of.Residence.Code' in the group_bysummarise(Preterm.Births =sum(ifelse(Is.Preterm, Births.x, 0), na.rm =TRUE),Preterm.Birth.Rate = (Preterm.Births / Total.Births) *100 ) %>%ungroup()
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
`summarise()` has grouped output by 'County.of.Residence',
'County.of.Residence.Code'. You can override using the `.groups` argument.
#load census shapefile data and merge by county FIPs counties_sf <-counties(cb =TRUE, resolution ="20m") # cb = TRUE for a simplified geometry
Retrieving data for the year 2022
# Merge the cleaned data with the counties shapefilecounty_map_preterm_data <-left_join(counties_sf, preterm_rate_data, by =c("GEOID"="County.of.Residence.Code"))county_map_cleaned_data <-left_join(counties_sf, cleaned_data, by =c("GEOID"="County.of.Residence.Code"))# Ensure the 'Sex.of.Infant' column is available and that it's a factor or charactercleaned_data$Sex.of.Infant <-as.factor(cleaned_data$Sex.of.Infant)
To create a data frame to assess preterm birth rates, I stratified the preterm-birth data by sex and calculated the preterm birth rate for each county by dividing the number of preterm births by the total number of births in each county, multiplied by 100 to report as a percentage.
# Create a new data frame stratified by sexpreterm_sex_data <- cleaned_data %>%mutate(Is.Preterm = OE.Gestational.Age.Recode.10%in%c("28 - 31 weeks", "32 - 35 weeks", "36 weeks", "20 - 27 weeks", "Under 20 weeks"),County.of.Residence.Code =as.character(County.of.Residence.Code) # Convert to character ) %>%group_by(County.of.Residence, County.of.Residence.Code, Sex.of.Infant) %>%summarise(Preterm.Births =sum(ifelse(Is.Preterm, Births.x, 0), na.rm =TRUE),Preterm.Birth.Rate = (Preterm.Births / Total.Births) *100 ) %>%ungroup()
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
`summarise()` has grouped output by 'County.of.Residence',
'County.of.Residence.Code', 'Sex.of.Infant'. You can override using the
`.groups` argument.
Analysis plan
Exploratory analysis of data:
Generate county-level choropleth maps of birth rates ((births/total population)*1000 for each county).
Generate county-level choropleth maps of preterm birth rates (<37 gestational age births/total population).
Generate bar graphs showing sex-stratified distribution of births by gestational age groups.
Generate bar graphs showing sex-stratified distribution of NICU admission rates by gestational age groups.
Conduct logistic regression to identify relationships between gestational age and sex with the outcome of NICU admission
Additional data analysis: Identify counties with highest and lowest preterm birth rates, and the highest and lowest NICU admission rates.
1.4 Results
First, I developed county-level choropleth maps of birth rates (live births per 1000 people) for the 496 counties represented in this dataset. I used this output to identify the county with the highest birth rate (Rockland County, NY; 19.4) and the lowest birth rate (Charlotte County, FL; 4.99). The average birth rate was 10.81.
The following object is masked from 'package:lubridate':
stamp
library(nhanesA)library(haven)library(viridis)
Loading required package: viridisLite
# Convert Birth.Rate to numeric (if necessary)county_map_cleaned_data$Birth.Rate <-as.numeric(county_map_cleaned_data$Birth.Rate)# Recheck the structure of the columnstr(county_map_cleaned_data$Birth.Rate)
num [1:11636] NA NA NA NA NA ...
# Remove rows where Birth.Rate is NAcounty_map_cleaned_data <- county_map_cleaned_data %>%filter(!is.na(Birth.Rate))#load map thememap_theme <-function() {theme_minimal() +theme(axis.line =element_blank(), axis.text =element_blank(), axis.title =element_blank(),panel.grid =element_line(color ="white"), legend.key.size =unit(0.8, "cm"), legend.text =element_text(size =16), legend.title =element_text(size =16) )}#create my palettemyPalette <-colorRampPalette(viridis(100))# Generate the choropleth map for birth ratesbirth_rates_map <-ggplot(data = county_map_cleaned_data) +geom_sf(aes(fill = Birth.Rate)) +# Map Preterm Birth Rate to the fillmap_theme() +# Apply the custom themeggtitle("2023 Birth Rate (live births per 1000 people)") +scale_fill_gradientn(name ="Birth Rate\nrate", colours =myPalette(100))coord_sf(xlim =c(-79.5, -75), ylim =c(37, 39))
<ggproto object: Class CoordSf, CoordCartesian, Coord, gg>
aspect: function
backtransform_range: function
clip: on
crs: NULL
datum: crs
default: FALSE
default_crs: NULL
determine_crs: function
distance: function
expand: TRUE
fixup_graticule_labels: function
get_default_crs: function
is_free: function
is_linear: function
label_axes: list
label_graticule:
labels: function
limits: list
lims_method: cross
modify_scales: function
ndiscr: 100
params: list
range: function
record_bbox: function
render_axis_h: function
render_axis_v: function
render_bg: function
render_fg: function
setup_data: function
setup_layout: function
setup_panel_guides: function
setup_panel_params: function
setup_params: function
train_panel_guides: function
transform: function
super: <ggproto object: Class CoordSf, CoordCartesian, Coord, gg>
# Print the mapprint(birth_rates_map)
#How many counties are represented in this analysis?num_unique_geoid_counties <-length(unique(county_map_cleaned_data$GEOID))print(num_unique_geoid_counties)
[1] 496
#Which counties have the highest and lowest birth rates?# Find the county with the highest Birth Rate and its value (show only the first entry)highest_birth_rate <- county_map_cleaned_data %>%filter(Birth.Rate ==max(Birth.Rate, na.rm =TRUE)) %>%slice(1) # Select the first rowcat("County with the highest Birth Rate: ", highest_birth_rate$County.of.Residence, "\n")
County with the highest Birth Rate: Rockland County, NY
# Find the county with the lowest Birth Rate and its value (show only the first entry)lowest_birth_rate <- county_map_cleaned_data %>%filter(Birth.Rate ==min(Birth.Rate, na.rm =TRUE)) %>%slice(1) # Select the first rowcat("County with the lowest Birth Rate: ", lowest_birth_rate$County.of.Residence, "\n")
County with the lowest Birth Rate: Charlotte County, FL
# Calculate the average Birth Rateaverage_birth_rate <- county_map_cleaned_data %>%summarise(Average_Birth_Rate =mean(Birth.Rate, na.rm =TRUE))cat("Average Birth Rate: ", average_birth_rate$Average_Birth_Rate, "\n")
Average Birth Rate: 10.81442
To better visualize and explore the birth rates, I then generated an interactive choropleth map of birth rates by county. This made it easier to identify regional trends and easily navigate to counties of interest such as Philadelphia County. Philadelphia County has a birth rate above average at 11.8 live births per 1000 people. My hometown in Hamilton County, TN has a similar birth rate at 11.9 live births per 1000 people.
#generate interactive choropleth map library(leaflet)library(RColorBrewer)# Define the color scale for the Birth Rate using colorBinpal_fun <-colorBin(palette =brewer.pal(9, "YlOrRd")[c(1:5, 7)], bins =c(0, 5, 10, 15, 20, 25), reverse =FALSE)# Create the popup message for each countypu_message <-paste0(county_map_cleaned_data$County.of.Residence, "<br>Birth Rate: ", round(county_map_cleaned_data$Birth.Rate, 1), " live births per 1000 people")# Create the leaflet map with the polygonsleaflet(county_map_cleaned_data) |>addPolygons(stroke =FALSE, fillColor =~pal_fun(Birth.Rate),fillOpacity =0.7, smoothFactor =0.5,popup = pu_message) |>addProviderTiles(providers$CartoDB.Positron) |>addLegend("bottomright", pal = pal_fun, values =~Birth.Rate, title ='Birth Rate', opacity =1) |>addScaleBar()
Warning: sf layer has inconsistent datum (+proj=longlat +datum=NAD83 +no_defs).
Need '+proj=longlat +datum=WGS84'
I repeated the analysis to generate county-level choropleth maps of preterm birth rates ((<37 gestational age births/total number of live births)*100). The average preterm birth rate of the counties in this dataset is 9.5%. Hinds County, MS has the highest preterm birth rate (17.85%) while Tompkins County, NY has the lowest preterm birth rate (2.63%).
# Convert Preterm.Birth.Rate to numericcounty_map_preterm_data$Preterm.Birth.Rate <-as.numeric(county_map_preterm_data$Preterm.Birth.Rate)# Recheck the structure of the columnstr(county_map_preterm_data$Preterm.Birth.Rate)
num [1:11636] NA NA NA NA NA ...
# Remove rows where Birth.Rate is NAcounty_map_preterm_data <- county_map_preterm_data %>%filter(!is.na(Preterm.Birth.Rate))# Generate the choropleth map for preterm birth ratespreterm_birth_rates_map <-ggplot(data = county_map_preterm_data) +geom_sf(aes(fill = Preterm.Birth.Rate)) +# Map Preterm Birth Rate to the fillmap_theme() +# Apply the custom themeggtitle("2023 Preterm Birth Rate % (<37 gestational age births/total number of live births)") +scale_fill_gradientn(name ="Preterm Birth Rate\nrate", colours =myPalette(100))coord_sf(xlim =c(-79.5, -75), ylim =c(37, 39))
<ggproto object: Class CoordSf, CoordCartesian, Coord, gg>
aspect: function
backtransform_range: function
clip: on
crs: NULL
datum: crs
default: FALSE
default_crs: NULL
determine_crs: function
distance: function
expand: TRUE
fixup_graticule_labels: function
get_default_crs: function
is_free: function
is_linear: function
label_axes: list
label_graticule:
labels: function
limits: list
lims_method: cross
modify_scales: function
ndiscr: 100
params: list
range: function
record_bbox: function
render_axis_h: function
render_axis_v: function
render_bg: function
render_fg: function
setup_data: function
setup_layout: function
setup_panel_guides: function
setup_panel_params: function
setup_params: function
train_panel_guides: function
transform: function
super: <ggproto object: Class CoordSf, CoordCartesian, Coord, gg>
preterm_birth_rates_map
#How many counties are represented in this analysis?num_unique_geoid_counties_preterm <-length(unique(county_map_preterm_data$GEOID))print(num_unique_geoid_counties_preterm)
[1] 496
#Which counties have the highest and lowest preterm birth rates?# Find the county with the highest preterm Birth Rate and its value (show only the first entry)highest_preterm_birth_rate <- county_map_preterm_data %>%filter(Preterm.Birth.Rate ==max(Preterm.Birth.Rate, na.rm =TRUE)) %>%slice(1) # Select the first rowcat("County with the highest preterm Birth Rate: ", highest_preterm_birth_rate$County.of.Residence, "\n")
County with the highest preterm Birth Rate: Hinds County, MS
# Find the county with the lowest preterm Birth Rate and its value (show only the first entry)lowest_preterm_birth_rate <- county_map_preterm_data %>%filter(Preterm.Birth.Rate ==min(Preterm.Birth.Rate, na.rm =TRUE)) %>%slice(1) # Select the first rowcat("County with the lowest preterm Birth Rate: ", lowest_preterm_birth_rate$County.of.Residence, "\n")
County with the lowest preterm Birth Rate: Tompkins County, NY
#What is the average preterm birth rate? # Calculate the average preterm birth rateaverage_preterm_birth_rate <-mean(county_map_preterm_data$Preterm.Birth.Rate, na.rm =TRUE)cat("Average preterm birth rate: ", round(average_preterm_birth_rate, 2), "\n")
Average preterm birth rate: 9.5
I now want a similarly interactive map of preterm birth data. From this, I can identify that Philadelphia has an above average preterm birth rate at 11%. My hometown in Hamilton County, TN has a similar preterm birth rate at 10.6%.
#generate interactive choropleth map library(leaflet)library(RColorBrewer)# Define the color scale for the Birth Rate using colorBinpal_fun <-colorBin(palette =brewer.pal(9, "YlOrRd")[c(1:5, 7)], bins =c(0, 5, 10, 15, 20, 25), reverse =FALSE)county_map_preterm_data$Preterm.Birth.Rate <-as.numeric(county_map_preterm_data$Preterm.Birth.Rate)# Create the popup message for each countypu_message <-paste0(county_map_preterm_data$County.of.Residence, "<br>Preterm Birth Rate: ", round(county_map_preterm_data$Preterm.Birth.Rate, 1), "%")# Create the leaflet map with the polygonsleaflet(county_map_preterm_data) |>addPolygons(stroke =FALSE, fillColor =~pal_fun(Preterm.Birth.Rate),fillOpacity =0.7, smoothFactor =0.5,popup = pu_message) |>addProviderTiles(providers$CartoDB.Positron) |>addLegend("bottomright", pal = pal_fun, values =~Preterm.Birth.Rate, title ='Preterm Birth Rate', opacity =1) |>addScaleBar()
Warning: sf layer has inconsistent datum (+proj=longlat +datum=NAD83 +no_defs).
Need '+proj=longlat +datum=WGS84'
I now want to interrogate the total number of births by gestational age group and visualize the distribution using a bar graph. I can see that most of the infants are born at 32-35 weeks and 37-39 weeks gestational age. The <20 week gestational age group is least represented.
# Summarize the data by gestational age for total birthsgestational_births_summary <- cleaned_data %>%group_by(OE.Gestational.Age.Recode.10) %>%summarise(Total_Births =n() # Count the total number of births in each gestational age category ) %>%ungroup()# Plot the total number of births by gestational ageggplot(gestational_births_summary, aes(x = OE.Gestational.Age.Recode.10, y = Total_Births, fill = OE.Gestational.Age.Recode.10)) +geom_bar(stat ="identity", show.legend =FALSE) +labs(title ="Total Number of Births by Gestational Age",x ="Gestational Age at Birth",y ="Total Number of Births" ) +theme(axis.text.x =element_text(angle =45, hjust =1))
Now I want to explore the proportion of NICU admission by gestational age. We can see a general trend where lower gestational ages have a higher proportion of NICU admission. Notably, infants less than 20 weeks old do not follow this trend, though this may represent survivor bias (infants born less than 20 weeks old may not survive for very long after birth or are inadequately powered to be assessed in this analysis). Infants born 28-31 weeks gestational age had the highest proportion of NICU admission.
# Summarizing the data by gestational age and NICU admission statusgestational_nicu_summary <- cleaned_data %>%group_by(OE.Gestational.Age.Recode.10) %>%summarise(Total_Births =n(), # Count the total number of births in each gestational age category (both Yes and No)NICU_Admissions =sum(NICU.Admission =="Yes", na.rm =TRUE), # Count NICU admissionsProportion_NICU_Admission = NICU_Admissions / Total_Births # Calculate proportion of NICU admissions ) %>%ungroup()# Now, plot the proportion of NICU admissions by gestational agelibrary(ggplot2)ggplot(gestational_nicu_summary, aes(x = OE.Gestational.Age.Recode.10, y = Proportion_NICU_Admission, fill = OE.Gestational.Age.Recode.10)) +geom_bar(stat ="identity", show.legend =FALSE) +labs(title ="Proportion of NICU Admissions by Gestational Age",x ="Gestational Age at Birth",y ="Proportion of NICU Admissions" ) +theme(axis.text.x =element_text(angle =45, hjust =1))
I repeated this analysis stratified by sex, first by visualizing the number of births per gestational age.
gestational_nicu_sex_summary <- cleaned_data %>%group_by(OE.Gestational.Age.Recode.10, Sex.of.Infant) %>%summarise(Total_Births =n() # Count the total number of infants for each gestational age and sex ) %>%ungroup()
`summarise()` has grouped output by 'OE.Gestational.Age.Recode.10'. You can
override using the `.groups` argument.
# Print the summarized data to checkprint(gestational_nicu_sex_summary)
# A tibble: 18 × 3
OE.Gestational.Age.Recode.10 Sex.of.Infant Total_Births
<chr> <fct> <int>
1 20 - 27 weeks Female 278
2 20 - 27 weeks Male 315
3 28 - 31 weeks Female 399
4 28 - 31 weeks Male 430
5 32 - 35 weeks Female 1102
6 32 - 35 weeks Male 1127
7 36 weeks Female 957
8 36 weeks Male 1042
9 37 - 39 weeks Female 1221
10 37 - 39 weeks Male 1235
11 40 weeks Female 892
12 40 weeks Male 937
13 41 weeks Female 720
14 41 weeks Male 745
15 42 weeks or more Female 110
16 42 weeks or more Male 134
17 Under 20 weeks Female 3
18 Under 20 weeks Male 6
library(ggplot2)ggplot(gestational_nicu_sex_summary, aes(x = OE.Gestational.Age.Recode.10, y = Total_Births, fill = Sex.of.Infant)) +geom_bar(stat ="identity", position ="dodge") +# Use 'dodge' for separate bars for each sexlabs(title ="Number of Infants Born by Gestational Age and Sex",x ="Gestational Age at Birth",y ="Total Number of Infants" ) +theme(axis.text.x =element_text(angle =45, hjust =1))
I then stratified this data by sex to visualize gestational age and proportion of NICU admissions.
gestational_nicu_sex_summary <- cleaned_data %>%group_by(OE.Gestational.Age.Recode.10, Sex.of.Infant) %>%summarise(Total_Births =n(), # Count the total number of births in each gestational age category and sexNICU_Admissions =sum(NICU.Admission =="Yes", na.rm =TRUE), # Count NICU admissionsProportion_NICU_Admission = NICU_Admissions / Total_Births # Calculate proportion of NICU admissions ) %>%ungroup()
`summarise()` has grouped output by 'OE.Gestational.Age.Recode.10'. You can
override using the `.groups` argument.
# Print the summarized data to checkprint(gestational_nicu_sex_summary)
# A tibble: 18 × 5
OE.Gestational.Age.Recode.10 Sex.of.Infant Total_Births NICU_Admissions
<chr> <fct> <int> <int>
1 20 - 27 weeks Female 278 226
2 20 - 27 weeks Male 315 258
3 28 - 31 weeks Female 399 360
4 28 - 31 weeks Male 430 391
5 32 - 35 weeks Female 1102 609
6 32 - 35 weeks Male 1127 618
7 36 weeks Female 957 337
8 36 weeks Male 1042 425
9 37 - 39 weeks Female 1221 595
10 37 - 39 weeks Male 1235 609
11 40 weeks Female 892 266
12 40 weeks Male 937 311
13 41 weeks Female 720 108
14 41 weeks Male 745 140
15 42 weeks or more Female 110 0
16 42 weeks or more Male 134 1
17 Under 20 weeks Female 3 0
18 Under 20 weeks Male 6 0
# ℹ 1 more variable: Proportion_NICU_Admission <dbl>
library(ggplot2)ggplot(gestational_nicu_sex_summary, aes(x = OE.Gestational.Age.Recode.10, y = Proportion_NICU_Admission, fill = Sex.of.Infant)) +geom_bar(stat ="identity", position ="dodge") +# Use 'dodge' for separate bars for each sexlabs(title ="Proportion of NICU Admissions by Gestational Age and Sex",x ="Gestational Age at Birth",y ="Proportion of NICU Admissions" ) +theme(axis.text.x =element_text(angle =45, hjust =1))
I then ran a logistic regression with gestational age only. Each variable was significant except for gestational age <20 weeks. The <20 weeks age group also has a high standard error (108.2). Moving forward, this will be excluded in the analysis.
# Convert NICU.Admission to binary (1 for "Yes", 0 for "No") cleaned_data$NICU.Admission <-ifelse(cleaned_data$NICU.Admission =="Yes", 1, 0)# Fit the logistic regression model again logistic_model <-glm(NICU.Admission ~ OE.Gestational.Age.Recode.10, data = cleaned_data, family ="binomial")# Summary of the logistic regression modelsummary(logistic_model)
Call:
glm(formula = NICU.Admission ~ OE.Gestational.Age.Recode.10,
family = "binomial", data = cleaned_data)
Coefficients:
Estimate Std. Error z value
(Intercept) 1.4907 0.1060 14.061
OE.Gestational.Age.Recode.1028 - 31 weeks 0.7740 0.1594 4.857
OE.Gestational.Age.Recode.1032 - 35 weeks -1.2882 0.1143 -11.275
OE.Gestational.Age.Recode.1036 weeks -1.9752 0.1156 -17.088
OE.Gestational.Age.Recode.1037 - 39 weeks -1.5298 0.1134 -13.485
OE.Gestational.Age.Recode.1040 weeks -2.2654 0.1174 -19.304
OE.Gestational.Age.Recode.1041 weeks -3.0815 0.1269 -24.289
OE.Gestational.Age.Recode.1042 weeks or more -6.9838 1.0076 -6.931
OE.Gestational.Age.Recode.10Under 20 weeks -14.0568 108.2480 -0.130
Pr(>|z|)
(Intercept) < 2e-16 ***
OE.Gestational.Age.Recode.1028 - 31 weeks 1.19e-06 ***
OE.Gestational.Age.Recode.1032 - 35 weeks < 2e-16 ***
OE.Gestational.Age.Recode.1036 weeks < 2e-16 ***
OE.Gestational.Age.Recode.1037 - 39 weeks < 2e-16 ***
OE.Gestational.Age.Recode.1040 weeks < 2e-16 ***
OE.Gestational.Age.Recode.1041 weeks < 2e-16 ***
OE.Gestational.Age.Recode.1042 weeks or more 4.19e-12 ***
OE.Gestational.Age.Recode.10Under 20 weeks 0.897
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 16042 on 11652 degrees of freedom
Residual deviance: 13837 on 11644 degrees of freedom
AIC: 13855
Number of Fisher Scoring iterations: 11
I repeated the analysis to exclude <20 weeks gestational age.
cleaned_data$OE.Gestational.Age.Recode.10<-factor(cleaned_data$OE.Gestational.Age.Recode.10, levels =c("40 weeks","20 - 27 weeks","28 - 31 weeks", "32 - 35 weeks", "36 weeks", "37 - 39 weeks", "41 weeks", "42 weeks or more"))# Fit the logistic regression model again logistic_model <-glm(NICU.Admission ~ OE.Gestational.Age.Recode.10, data = cleaned_data, family ="binomial")# Summary of the logistic regression modelsummary(logistic_model)
Call:
glm(formula = NICU.Admission ~ OE.Gestational.Age.Recode.10,
family = "binomial", data = cleaned_data)
Coefficients:
Estimate Std. Error z value
(Intercept) -0.77466 0.05032 -15.395
OE.Gestational.Age.Recode.1020 - 27 weeks 2.26539 0.11736 19.304
OE.Gestational.Age.Recode.1028 - 31 weeks 3.03935 0.12917 23.531
OE.Gestational.Age.Recode.1032 - 35 weeks 0.97723 0.06592 14.826
OE.Gestational.Age.Recode.1036 weeks 0.29016 0.06821 4.254
OE.Gestational.Age.Recode.1037 - 39 weeks 0.73556 0.06451 11.403
OE.Gestational.Age.Recode.1041 weeks -0.81606 0.08594 -9.496
OE.Gestational.Age.Recode.1042 weeks or more -4.71840 1.00135 -4.712
Pr(>|z|)
(Intercept) < 2e-16 ***
OE.Gestational.Age.Recode.1020 - 27 weeks < 2e-16 ***
OE.Gestational.Age.Recode.1028 - 31 weeks < 2e-16 ***
OE.Gestational.Age.Recode.1032 - 35 weeks < 2e-16 ***
OE.Gestational.Age.Recode.1036 weeks 2.10e-05 ***
OE.Gestational.Age.Recode.1037 - 39 weeks < 2e-16 ***
OE.Gestational.Age.Recode.1041 weeks < 2e-16 ***
OE.Gestational.Age.Recode.1042 weeks or more 2.45e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 16031 on 11643 degrees of freedom
Residual deviance: 13837 on 11636 degrees of freedom
(9 observations deleted due to missingness)
AIC: 13853
Number of Fisher Scoring iterations: 7
Using these data, I then calculated odds ratios for gestational age on the outcome of NICU admission. From these data, 28-31 weeks and 20-27 weeks had the highest positive association with NICU admission, Gestational age of 41 and 41 weeks or more had an odds ratio less than 1, indicating these infants are less likely to be admitted to the NICU.
# Extract model coefficientscoefficients <-summary(logistic_model)$coefficients# Calculate odds ratios (exponentiate the coefficients)odds_ratios <-exp(coefficients[, "Estimate"])lower_ci <-exp(coefficients[, "Estimate"] -1.96* coefficients[, "Std. Error"])upper_ci <-exp(coefficients[, "Estimate"] +1.96* coefficients[, "Std. Error"])# Add confidence intervals to the data frameodds_ratios_df <-data.frame(Gestational_Age =rownames(coefficients),Odds_Ratio = odds_ratios,Lower_CI = lower_ci,Upper_CI = upper_ci)# Remove the intercept rowodds_ratios_df <- odds_ratios_df[odds_ratios_df$Gestational_Age !="(Intercept)", ]print(odds_ratios_df)
library(ggplot2)ggplot(odds_ratios_df, aes(x =reorder(Gestational_Age, Odds_Ratio), y = Odds_Ratio)) +geom_pointrange(aes(ymin = Lower_CI, ymax = Upper_CI), color ="blue", shape =16) +# Add points with error barscoord_flip() +# Flip axes for readabilitygeom_hline(yintercept =1, linetype ="dashed", color ="red") +# Reference line at OR = 1labs(title ="Odds Ratios for NICU Admission",x ="Variable",y ="Odds Ratio (OR)" ) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1)) # Improve label readability
I repeated this analysis to conduct a logistic regression and calculate odds ratios for gestational age, this time including male sex. Being born male had a slight association with NICU admission (OR: 1.1, 95% CI: 1.022-1.20).
# Fit the logistic regression model with Gestational age and Sex of infantlogistic_model <-glm(NICU.Admission ~ OE.Gestational.Age.Recode.10+ Sex.of.Infant, data = cleaned_data, family ="binomial")# Show the model summarysummary(logistic_model)
Call:
glm(formula = NICU.Admission ~ OE.Gestational.Age.Recode.10 +
Sex.of.Infant, family = "binomial", data = cleaned_data)
Coefficients:
Estimate Std. Error z value
(Intercept) -0.82738 0.05466 -15.136
OE.Gestational.Age.Recode.1020 - 27 weeks 2.26476 0.11738 19.294
OE.Gestational.Age.Recode.1028 - 31 weeks 3.04023 0.12919 23.534
OE.Gestational.Age.Recode.1032 - 35 weeks 0.97852 0.06594 14.840
OE.Gestational.Age.Recode.1036 weeks 0.28941 0.06823 4.242
OE.Gestational.Age.Recode.1037 - 39 weeks 0.73698 0.06453 11.421
OE.Gestational.Age.Recode.1041 weeks -0.81606 0.08596 -9.494
OE.Gestational.Age.Recode.1042 weeks or more -4.72295 1.00135 -4.717
Sex.of.InfantMale 0.10198 0.04087 2.495
Pr(>|z|)
(Intercept) < 2e-16 ***
OE.Gestational.Age.Recode.1020 - 27 weeks < 2e-16 ***
OE.Gestational.Age.Recode.1028 - 31 weeks < 2e-16 ***
OE.Gestational.Age.Recode.1032 - 35 weeks < 2e-16 ***
OE.Gestational.Age.Recode.1036 weeks 2.22e-05 ***
OE.Gestational.Age.Recode.1037 - 39 weeks < 2e-16 ***
OE.Gestational.Age.Recode.1041 weeks < 2e-16 ***
OE.Gestational.Age.Recode.1042 weeks or more 2.40e-06 ***
Sex.of.InfantMale 0.0126 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 16031 on 11643 degrees of freedom
Residual deviance: 13831 on 11635 degrees of freedom
(9 observations deleted due to missingness)
AIC: 13849
Number of Fisher Scoring iterations: 7
# Extract model coefficientscoefficients <-summary(logistic_model)$coefficients# Calculate odds ratios (exponentiate the coefficients)odds_ratios <-exp(coefficients[, "Estimate"])lower_ci <-exp(coefficients[, "Estimate"] -1.96* coefficients[, "Std. Error"])upper_ci <-exp(coefficients[, "Estimate"] +1.96* coefficients[, "Std. Error"])# Create a data frame of the odds ratiosodds_ratios_df <-data.frame(Variable =rownames(coefficients),Odds_Ratio = odds_ratios,Lower_CI = lower_ci,Upper_CI = upper_ci)# Remove the intercept rowodds_ratios_df <- odds_ratios_df[odds_ratios_df$Variable !="(Intercept)", ]# Print the odds ratios data frameprint(odds_ratios_df)
library(ggplot2)ggplot(odds_ratios_df, aes(x =reorder(Variable, Odds_Ratio), y = Odds_Ratio)) +geom_pointrange(aes(ymin = Lower_CI, ymax = Upper_CI), color ="blue", shape =16) +# Add points with error barscoord_flip() +# Flip axes for readabilitygeom_hline(yintercept =1, linetype ="dashed", color ="red") +# Reference line at OR = 1labs(title ="Odds Ratios for NICU Admission",x ="Variable",y ="Odds Ratio (OR)" ) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1)) # Improve label readability
To interrogate if males more likely to be preterm, I stratified the data into preterm (<37 weeks) and term gestational age groups and conducted a logistic regression. Male sex was not significant (p = 0.482).
# Create binary variable for preterm (1) vs term (0)cleaned_data$Preterm_Binary <-ifelse(cleaned_data$OE.Gestational.Age.Recode.10%in%c("Under 20 weeks", "20 - 27 weeks", "28 - 31 weeks", "32 - 35 weeks", "36 weeks"), 1, 0)# Logistic regression to test if males are more likely to be pretermlogistic_model <-glm(Preterm_Binary ~ Sex.of.Infant, data = cleaned_data, family ="binomial")summary(logistic_model)
Call:
glm(formula = Preterm_Binary ~ Sex.of.Infant, family = "binomial",
data = cleaned_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.07395 0.02655 -2.785 0.00535 **
Sex.of.InfantMale 0.02604 0.03708 0.702 0.48249
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 16144 on 11652 degrees of freedom
Residual deviance: 16143 on 11651 degrees of freedom
AIC: 16147
Number of Fisher Scoring iterations: 3
I visualized number of infants born term and preterm stratified by sex using a bar graph.
library(ggplot2)# Summarize the data by Sex and Preterm statussummary_data <- cleaned_data %>%group_by(Sex.of.Infant, Preterm_Binary) %>%summarize(Count =n(), .groups ="drop")# Map Preterm_Binary to labels for the plotsummary_data$Preterm_Label <-ifelse(summary_data$Preterm_Binary ==1, "Preterm", "Term")# Create the bar plot with Gestational Age on the x-axisggplot(summary_data, aes(x = Preterm_Label, y = Count, fill = Sex.of.Infant)) +geom_bar(stat ="identity", position ="dodge") +# Dodge for side-by-side barslabs(title ="Number of Infants Born Term and Preterm Stratified by Sex",x ="Gestational Age",y ="Number of Infants",fill ="Sex of Infant" ) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1), # Angle x-axis labels for better readabilitylegend.position ="top" )
To identify if males are more likely to be born extremely preterm (<28 wks), I conducted a logistic regression by combining the age groups <20 weeks and 20-27 weeks to create a binary outcome variable. Male sex was not significant (p = 0.344).
# Create a binary variable for extremely preterm (<28 weeks)cleaned_data$Extremely_Preterm <-ifelse(cleaned_data$OE.Gestational.Age.Recode.10=="Under 20 weeks"| cleaned_data$OE.Gestational.Age.Recode.10=="20 - 27 weeks", 1, 0)# Fit a logistic regression modelmodel <-glm(Extremely_Preterm ~ Sex.of.Infant, data = cleaned_data, family = binomial)# Summary of the modelsummary(model)
Call:
glm(formula = Extremely_Preterm ~ Sex.of.Infant, family = binomial,
data = cleaned_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.96672 0.06150 -48.242 <2e-16 ***
Sex.of.InfantMale 0.07988 0.08446 0.946 0.344
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4686.4 on 11643 degrees of freedom
Residual deviance: 4685.5 on 11642 degrees of freedom
(9 observations deleted due to missingness)
AIC: 4689.5
Number of Fisher Scoring iterations: 5
1.5 Conclusion
This study used natality data from CDC WONDER to interrogate preterm birth within a cohort of 11,653 infants across 496 counties. I conducted an exploratory analysis to generate interactive maps and identify counties with the highest (Hinds County, MS; 17.9 %) and lowest (Tompkins, NY; 2.6%) preterm birth rates. Given the known regional health disparities, it is unsurprising the highest preterm birth rates are found in the Deep South. This provides evidence for the need to implement a county-level intervention to address preterm birth inequity.
I used logistic regression to identify the associations of gestational age and sex in NICU admission. Gestational age of 28-31 weeks had the highest association with NICU admission (OR: 20.9, 95% CI: 16.23-26.9) while being male only slightly increased likelihood of NICU admission (OR: 1.1, 95% CI: 1.022-1.20). A gestational age of 41 weeks (OR: 0.44, 95% CI: 0.37-0.52) or >42 weeks (OR: 0.008, 95% CI: 0.0012-0.063) decreased the likelihood of NICU admission. These associations are in agreement with previous studies indicating that infants born at lower gestational ages are more likely to require therapeutic intervention (Thebaud et al., 2019).
Myriad biological and social determinants of health contribute to preterm birth outcomes. There are well-established regional health disparities, with the southeastern United States having the highest preterm birth rates (March of Dimes, 2024). Structural racism also contributes to preterm birth inequities, with Black infants having the highest probability of preterm birth (March of Dimes, 2024). Additional co-morbidities and environmental exposures contribute, such as hypertension, extreme heat, diabetes, and air pollution (March of Dimes, 2024). There are similar disparities for infant and maternal mortality in the United States (March of Dimes, 2024).
The objective of this study was to interrogate sex as a biological variable in NICU admissions. An understanding of sex differences is critical for precision medicine approaches. This will allow therapeutic intervention that is tailored to the right patient. Additionally, by understanding sex-related mechanisms that are advantageous or deleterious in a disease context, this may uncover therapeutic targets that can be beneficial for both sexes.
Future works can investigate additional neonatal outcomes and preterm birth indicators such as chorioamnionitis, bronchopulmonary dysplasia, retinopathy of prematurity, and preeclampsia. Integration of patient data can also consider maternal characteristics and health status.
1.6 References
Albert, R., Lee, A., Lingappan, K. Response to Therapeutic Interventions in the NICU: Role of Sex as a Biological Variable. 2023. Neoreviews, 24(12), e797-e805.
Bell, E.F., et al., Mortality, in-hospital morbidity, care practices, and 2-year outcomes for extremely preterm infants in the US, 2013-2018. JAMA, 2022. 327(3): p. 248-263.
Callaghan WM, MacDorman MF, Rasmussen SA, Qin C, Lackritz EM. The contribution of preterm birth to infant mortality rates in the United States. Pediatrics. 2006;118(4):1566-1573.
Inkster AM, Fernández-Boyano I, Robinson WP. Sex differences are here to stay: relevance to prenatal care. Journal of Clinical Medicine. 2021 Jul 5;10(13):3000.
NOT-OD-15-102: Consideration of Sex as a Biological Variable in NIH-Funded Research. https://grants.nih.gov/grants/guide/notice-files/NOT-OD-15-102.html. Accessed 10 Dec. 2024.
Thebaud, B., et al., Bronchopulmonary dysplasia. Nat Rev Dis Primers, 2019. 5(1): p. 78.
van Westering-Kroon, E., et al., Male Disadvantage in Oxidative Stress-Associated Complications of Prematurity: A Systematic Review, Meta-Analysis and Meta-Regression. Antioxidants (Basel), 2021. 10(9).